KEP-5677: DRA Resource Availability Visibility by nmn3m · Pull Request #5749 · kubernetes/enhancements

nmn3m · 2025-12-23T17:46:43Z

One-line PR description: Adding KEP-5677 for DRA Resource Availability Visibility
Issue link: DRA: Resource Availability Visibility #5677

k8s-ci-robot · 2025-12-23T17:46:46Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-12-29T23:31:43Z

@nmn3m: GitHub didn't allow me to request PR reviews from the following users: kubernetes/sig-scheduling, kubernetes/sig-node, kubernetes/sig-cli, kubernetes/wg-device-management.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @kubernetes/sig-scheduling
/cc @kubernetes/sig-node
/cc @kubernetes/sig-cli
/cc @kubernetes/wg-device-management

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nmn3m · 2025-12-29T23:42:26Z

/cc @johnbelamaric
/cc @pohly

nmn3m · 2025-12-29T23:45:24Z

/cc @kubernetes/sig-cli-kubectl-maintainers

mortent · 2026-01-06T16:43:53Z

/wg device-management

keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md

keps/sig-node/5677-dra-resource-availability-visibility/README.md

johnbelamaric

First pass, this is looking really really good to me so far

keps/sig-node/5677-dra-resource-availability-visibility/README.md

keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md

keps/sig-node/5677-dra-resource-availability-visibility/README.md

keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md

keps/sig-node/5677-dra-resource-availability-visibility/README.md

liggitt · 2026-02-10T15:10:20Z

@liggitt WDYT?

An always available, in-tree approach is preferable, but I am not sure there's a great option for that. Here's what I could think of:

A "request" is made by creating a ResourcePool object, or maybe it's even called something like "ResourcePoolStatusRequest" or something. A controller runs in KCM that sees that request, makes the calculation, and writes the result to the object's status, where it can be observed by the user. It is a one-time operation with a timestamp. To recalculate, the user has to delete and recreate the object. The object probably should be cluster scoped.

A specialized API endpoint built into API server. I suspect this is a no-go. Jordan, is there any precedent for that?

A specialized API endpoint in KCM that is then exposed via an aggregated API configuration. Jordan, any precedent?

The first one seems promising if we want to do this in-tree. I actually think it's fine, and there is precedent for similar "imperative operations through declarative APIs" with things like CSR and even the way device taints with "None" effect works. It gives us the ability to controller permissions on the object, too.

For out-of-tree (could be in k-sigs), we could implement JUST a kubectl plugin and rely on user permissions, to start. And add in some aggregated API server later, if we see the need.

The advantage of in-tree: always available and in-sync with K8s releases, all users can depend on it. Disadvantage: locked to K8s release cycle.

Advantage of out-of-tree: we can implement it independently of the release cycle.

My preference: the first in-tree option.

Of those three options, the first seems the best to me as well. If it's an object we intend to be created, waited for, read, then deleted only to get a view of status, making it separate from a general ResourcePool type is a good idea.

We should also define behavior when multiple of these exist at the same time for the same pool (controller calculates once then fills all, etc).

liggitt

some questions about the filtering / scale / limit bits, but those seem ok to pin down in implementation review as well

liggitt · 2026-02-10T15:12:54Z

keps/sig-node/5677-dra-resource-availability-visibility/README.md

+| `driver` | Filter by driver name (optional) |
+| `poolName` | Filter by pool name (optional, requires driver) |


making both of these optional means status is very unbounded ... should at least driver be required?

I think, it will be no problem in that.
@johnbelamaric WDYT?

sure, that's fine

liggitt · 2026-02-10T15:13:52Z

keps/sig-node/5677-dra-resource-availability-visibility/README.md

+|-------|-------------|
+| `driver` | Filter by driver name (optional) |
+| `poolName` | Filter by pool name (optional, requires driver) |
+| `limit` | Max pools to return (default: 100, max: 1000) |


I expected a request structure that would not result in unbounded status that would require limits like this

I'm also not sure where the 100 / 1000 limits came from ... with ResourceSlice, we've been really specific about the maximum size possible if all fields / lists are at their maximum size, to be sure the resulting resource could actually be persisted. Was that done here?

No, the rigorous max-size calculation was not done. The 100/1000 numbers were chosen based on patterns in other K8s APIs, not from first principles. If we keep a limit field, we should do the proper calculation.

Alternatively, if we make driver required, the response becomes naturally bounded and the limit field may not be needed.

@liggitt WDYT?

there could still be a LOT for a given driver. I think we should have a limit. If we can calculate that now that's great, but we can also defer it to implementation time.

if we need a limit, make sure it is principled, and consider whether we need to make it user-specifiable, and consider whether the use cases we intend will break if a truncated response is received (e.g. an autoscaler couldn't use truncated info, right?)

…tate enum, add Security Considerations and Consistency Handling sections

…dback

Signed-off-by: Nour <nurmn3m@gmail.com>

…alidation Signed-off-by: Nour <nurmn3m@gmail.com>

Signed-off-by: Nour <nurmn3m@gmail.com>

…Freeze requirement and removed sig-scheduling Signed-off-by: Nour <nurmn3m@gmail.com>

mrunalp · 2026-02-10T18:37:09Z

/approve
for sig-node. Thanks for evaluating various alternatives reaching this design!
/hold

@johnbelamaric Please cancel the hold once ready. Thanks!

k8s-ci-robot · 2026-02-10T18:37:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kannon92, mrunalp, nmn3m

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [kannon92]
~~keps/sig-node/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ased on size calculations Signed-off-by: Nour <nurmn3m@gmail.com>

johnbelamaric · 2026-02-10T20:06:24Z

/hold cancel

Thank you!

johnbelamaric · 2026-02-10T20:06:51Z

/lgtm

liggitt · 2026-02-12T17:14:37Z

(post-merge note ... we might want to consider automatically deleting these via the controller after a fixed time period after creation / population ... we do that with other resources like certificate requests, etc ... can discuss during implementation)

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 23, 2025

github-project-automation bot added this to SIG Scheduling Dec 23, 2025

k8s-ci-robot requested review from dom4ha and macsko December 23, 2025 17:46

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 23, 2025

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from ca95081 to d9ac678 Compare December 29, 2025 23:26

nmn3m marked this pull request as ready for review December 29, 2025 23:31

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 29, 2025

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from d9ac678 to 495b6cb Compare December 29, 2025 23:36

k8s-ci-robot requested review from johnbelamaric and pohly December 29, 2025 23:42

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 6, 2026

github-project-automation bot added this to Dynamic Resource Allocation Jan 6, 2026

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jan 6, 2026

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jan 7, 2026

mortent reviewed Jan 8, 2026

View reviewed changes

johnbelamaric reviewed Jan 8, 2026

View reviewed changes

k8s-ci-robot assigned mrunalp Jan 8, 2026

mortent reviewed Jan 9, 2026

View reviewed changes

keps/sig-node/5677-dra-resource-availability-visibility/README.md Outdated Show resolved Hide resolved

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from 495b6cb to fdbf949 Compare January 14, 2026 22:37

nmn3m requested review from johnbelamaric and mortent January 14, 2026 22:37

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from dd8b15d to f26f289 Compare February 10, 2026 14:18

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2026

nmn3m requested review from dom4ha, johnbelamaric and wendy-ha18 February 10, 2026 14:19

liggitt reviewed Feb 10, 2026

View reviewed changes

nmn3m added 9 commits February 10, 2026 20:34

Add KEP-5677 DRA Resource Availability Visibility

abc4866

KEP-5677: Address PR review feedback - replace DeviceCondition with S…

406b32f

…tate enum, add Security Considerations and Consistency Handling sections

KEP-5677: Move KEP from sig-scheduling to sig-node per reviewer request.

421ddca

KEP-5677: Redesign to use ResourcePool object based on API review fee…

78e14bc

…dback

KEP-5677: Add PRR file

7bd4bb7

Signed-off-by: Nour <nurmn3m@gmail.com>

Redesign to use CSR-like ResourcePoolStatusRequest pattern

aff0169

Signed-off-by: Nour <nurmn3m@gmail.com>

Fix questionnaire format, clarify RBAC/tests, add spec immutability v…

6b0768e

…alidation Signed-off-by: Nour <nurmn3m@gmail.com>

Update approver list

6db171a

Signed-off-by: Nour <nurmn3m@gmail.com>

Changed status from provisional to implementable for the Enhancement …

8cf45af

…Freeze requirement and removed sig-scheduling Signed-off-by: Nour <nurmn3m@gmail.com>

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2026

Make driver field required and defer limit values to implementation b…

a3d1151

…ased on size calculations Signed-off-by: Nour <nurmn3m@gmail.com>

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from f26f289 to a3d1151 Compare February 10, 2026 19:08

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2026

k8s-ci-robot merged commit f48c046 into kubernetes:master Feb 10, 2026
4 checks passed

k8s-ci-robot added this to the v1.36 milestone Feb 10, 2026

nmn3m mentioned this pull request Feb 15, 2026

KEP-5677: Add ResourcePoolStatusRequest API for DRA resource availability visibility kubernetes/kubernetes#137028

Merged

		\| `driver` \| Filter by driver name (optional) \|
		\| `poolName` \| Filter by pool name (optional, requires driver) \|

Conversation

nmn3m commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 23, 2025

Uh oh!

k8s-ci-robot commented Dec 29, 2025

Uh oh!

nmn3m commented Dec 29, 2025

Uh oh!

nmn3m commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mortent commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liggitt commented Feb 10, 2026

Uh oh!

liggitt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrunalp commented Feb 10, 2026

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

johnbelamaric commented Feb 10, 2026

Uh oh!

johnbelamaric commented Feb 10, 2026

Uh oh!

Uh oh!

liggitt commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

nmn3m commented Dec 23, 2025 •

edited

Loading

nmn3m commented Dec 29, 2025 •

edited

Loading